Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

نویسندگان

  • Daniel Manrique-Vallier
  • Jerome P. Reiter
چکیده

In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a three year old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully utilize information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an approach that integrates editing and imputation for categorical microdata with structural zeros. We rely on a Bayesian hierarchical model that includes (i) a Dirichlet process mixture of multinomial distributions as the model for the underlying true values of the data, with support restricted to the set of theoretically possible combinations, (ii) a model for latent indicators of the values that are in error, and (iii) a model for the reported responses for values in error. We illustrate this integrated approach using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Imputation of parent-offspring trios and their effect on accuracy of genomic prediction using Bayesian method

The objective of this study was to evaluate the imputation accuracy of parent-offspring trios under different scenarios. By using simulated datasets, the performance Bayesian LASSO in genomic prediction was also examined. The genome consisted of 5 chromosomes and each chromosome was set as 1 Morgan length. The number of SNPs per chromosome was 10000. One hundred QTLs were randomly distributed a...

متن کامل

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence

We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We inco...

متن کامل

Missing data imputation in multivariable time series data

Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...

متن کامل

Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian, joint modeling approach to multiple imputation fo...

متن کامل

Results of Evaluation of AGGIES for ACES

The U. S. Census Bureau’s Annual Capital Expenditures Survey (ACES) collects data about domestic capital expenditures in non-farm businesses operating within the United States. Analysts manually edit the ACES data using a specified set of editing rules. Although individual edits are straightforward, the hierarchical combination of edits are complicated with several nested levels of simultaneous...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015